feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload#845
feat(e2e_eval): add --build-only mode with per-EP matrix, export dedup, and Azure Artifacts upload#845KayMKM wants to merge 13 commits into
Conversation
Add a --build-only mode to run_eval.py that runs config + build with --no-compile, writing each pipeline stage's ONNX (export/optimize/quantize) without requiring execution-provider hardware. Perf and accuracy are skipped. When --ep/--device are omitted, every model is built once per EP in the build-only matrix (qnn npu/gpu, openvino cpu/npu/gpu, mlas, dml, vitisai) into <model_dir>/<ep>_<device>/ subdirs. When either is pinned, a single build is written directly into the model dir. Precision per combo reuses the existing _resolve_precision policy (NPU w8a16, CPU/GPU auto, native-quant EPs unquantized). Reuses the existing _run_build via a build_only flag (-o <dir> --no-compile instead of --use-cache).
Two bugs surfaced when running `run_eval.py --build-only` against the EP matrix on a CPU-only host: 1. Every combo for the 'no native EP' subset (mlas/dml/openvino) was reported as `[FAIL @ complete]` even though export/optimize/quantize/model.onnx all landed correctly. `_run_build` was funnelling build-only results through `_extract_onnx_path`, which scans stdout for a `Final artifact:` marker that `winml build --no-compile` never prints, and falls back to the global cache which build-only doesn't populate (`-o <dir>` writes elsewhere). In build-only mode there is no downstream consumer of the path, so trust exit-code 0 directly and record `build_out` to keep the per-component bookkeeping balanced. 2. QNN/VitisAI combos failed at the optimize stage with `Requested EP 'qnn' is not available on this system`. `_run_optimize_stage` calls `resolve_device(device, ep=ep)` purely to pick the right `has_rule_data_for_ep` key for the progress bar, but that helper raises when the EP isn't installed locally -- even when the rest of the pipeline (export + optimize + quantize) runs on CPU and the EP is only needed at compile time. Soft-fail the lookup *only when* `config.compile is None` (i.e. `--no-compile` or a config that explicitly opts out); otherwise re-raise so configs that will compile still fail fast here instead of deep inside the compile stage. Also moves `--clean-cache` from per-combo to per-model in `_run_build_only`: combos for the same model share the same HF download, so clearing between combos forced N redundant re-downloads of the same weights.
…facts feed Running --build-only over the 8-EP matrix for many models fills local disk. Two additions keep disk bounded: 1. Export dedup: the export.onnx stage is EP/device-independent, so every combo produces an identical export. After each combo builds, its export is hash-compared against a per-model canonical: the first is moved to <model_dir>/_shared/, later identical ones are deleted. One export copy on disk instead of 8. 2. --upload: after a model's combos are built, publish the model dir to the Modelkit Azure Artifacts feed as a Universal Package version, then delete it locally. Auth via az login (no PAT); the azure-devops extension is ensured and login verified up front (aborts otherwise so disk isn't silently filled). Version is 0.0.0-<run-stamp>-<model-slug> (valid SemVer 2.0; date stamp groups a batch). --continue + --run-stamp resume an interrupted batch from the build_only_uploads.json manifest without rebuilding uploaded models; --keep-local, --upload-skip-existing, and feed/package args round it out.
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
Reviewed the full diff. Overall structure is solid and docstrings are thorough. Found a potential data-loss path (false-positive conflict detection triggers local directory deletion), a silent no-op for --continue in the non-upload build-only path, and a few other correctness/usability issues — see inline comments.
… versions The --continue skip logic only consulted the local build_only_uploads.json manifest, which is written after each successful upload. A fresh --output-dir (e.g. a gitignored temp dir) starts empty, so models already published to the Azure Artifacts feed under the same run-stamp were rebuilt and re-uploaded instead of being skipped. Seed the in-memory manifest from the feed at startup: query the feed REST API for versions matching 0.0.0-<run-stamp>-* and mark them as uploaded so the existing skip check honors them. The feed is now authoritative for what's published, regardless of local state. Querying is best-effort -- a failure falls back to local-manifest-only behavior. Use two ampersand-free REST GETs (list packages -> resolve UPack package GUID -> list versions) because az resolves to az.cmd and cmd.exe splits query strings on '&', dropping every parameter after the first.
- _hash_files: stop hashing unreadable files to a fixed sentinel; propagate OSError and have _dedup_export keep the export in place instead of risking deletion of an artifact never verified identical. - _is_publish_conflict: narrow detection to specific version-exists / HTTP 409 markers (drop bare 'conflict'/'409') so an unrelated message can't trigger exists-skipped and rmtree the local model dir. - build.py _run_optimize_stage: narrow the no-compile EP fallback to only swallow EP-not-available ValueErrors; re-raise malformed device/EP names. - Warn when --continue is used with --build-only but without --upload (no local-disk resume exists, so everything is rebuilt). - Document that the pinned-EP auto-device path delegates precision to winml config's auto-detection. - Fix misleading --upload-skip-existing help: it does not skip the build.
When a mid-run upload failed because Azure CLI was unavailable (not logged in, token expired, or az hung and was killed), the model was marked 'failed' and kept locally while the batch continued. Every subsequent model hit the same az failure and its local copy piled up, filling the disk (a single 7B LLM matrix is ~450 GB). Add _is_az_unavailable() to distinguish a host-level az/login problem from a per-package publish error (network blip or version conflict), and abort the run immediately (exit 3) when an upload fails for that reason. The user re-runs 'az login' and resumes with --continue + the same --run-stamp; already-uploaded models are skipped.
Uploading all 8 EP/device combos as one Universal Package version timed out on large models and left the local artifacts behind, filling the disk. Each combo is now built, uploaded, and deleted on its own. - Version is per combo: 0.0.0-<run-stamp>-<ep>-<device>-<model-slug>, so each package is small (lower timeout risk) and can be retried/resumed on its own. - Local artifacts are removed after every outcome (uploaded, version-exists, timeout, upload-failed, build-failed) unless --keep-local, so disk stays bounded. A timeout/failure is recorded and the run continues; only a host-level az failure (not logged in / token expired) aborts. - Every (model, combo) outcome is written to build_only_results.json (replaces build_only_uploads.json) for auditing, with or without --upload; it also drives per-combo --continue resume. - Export dedup now applies only without --upload (each uploaded combo is self-contained). Adds _classify_upload helper and unit tests for the version format, results I/O, az-unavailable classification, and the per-combo cleanup/timeout/abort orchestration.
…atrix # Conflicts: # src/winml/modelkit/commands/build.py
DingmaomaoBJTU
left a comment
There was a problem hiding this comment.
v2 re-review: the previous five issues are all addressed -- OSError propagation in _hash_files, narrow conflict markers, --continue warning, per-combo upload design, and the build.py ep_unavailable guard. Remaining findings: one correctness bug (pagination truncation in _fetch_feed_versions silently breaks --continue resume on large feeds) and two lower-severity concerns.
- _fetch_feed_versions: return None (not an empty set) when the package is absent from the /packages listing. That listing is paginated (~25/page default) and $top can't be appended here -- a second query param needs `&`, which az.cmd/cmd.exe splits -- so an empty set was indistinguishable from a pagination miss and silently rebuilt the whole batch on --continue. None takes the explicit "could not query feed" fallback instead. - _is_az_unavailable: narrow the broad 'refresh token' marker to 'invalid_grant' (the OAuth2 code Azure AD emits for an expired/revoked refresh token), so an informational MSAL "Refreshing token..." line on a transient exit-1 can no longer abort the entire run. - Document that --upload-skip-existing is the safe flag on a --continue resume: a timed-out upload may have committed server-side, so the retry's 409 should count as done, not failed (README example + flags table, the flag's help text, and the _classify_upload docstring). Adds unit tests for the narrowed marker and for _fetch_feed_versions not-found -> None / matching-versions / query-failure.
CodeQL flagged the best-effort 'remove now-empty model dir' cleanup in _run_build_only as an empty 'except OSError: pass'. Add an explanatory comment (no behaviour change) so the intent -- ignoring a non-empty/locked dir -- is documented.
Summary
Adds a
--build-onlymode toscripts/e2e_eval/run_eval.pythat generates ModelKit pipeline artifacts (export → optimize → quantize, no compile) across the full EP matrix, without requiring EP hardware, and optionally streams them to theModelkitAzure Artifacts feed while bounding local disk usage.Motivation: we need a way to mass-produce per-EP model artifacts for ~200 models on a single (CPU) box and distribute them, but (a) the build normally needs the target EP installed, and (b) writing every stage for every (model × EP) fills the disk fast.
What's included
1.
--build-onlymode + per-EP matrixwinml config+winml build --no-compileper model; perf/accuracy are skipped.--ep/--deviceare omitted, builds the eval EP matrix into<model_dir>/<ep>_<device>/subdirs:qnn_npu,qnn_gpu,ov_cpu,ov_npu,ov_gpu,mlas_cpu,dml_gpu,vitisai_npu. Pinning--ep/--devicewrites a single build directly into<model_dir>.w8a16, CPU/GPU → auto, native-quant EPs →--no-quant).2. Cross-EP / cross-host builds (core fix in
build.py)_run_optimize_stagecalledresolve_device(device, ep=ep)purely to pick a progress-bar key, which raised when the target EP wasn't installed locally — blocking offline generation of (e.g.) QNN/VitisAI artifacts on a CPU box.config.compile is None), the missing-EP lookup soft-fails and falls back to the requested device (optimize/quantize only need the EP's static rule data, not a registered EP). When compile will run, it still fails fast. No behavior change for normal compiling builds.3. Export dedup (disk saver)
export.onnxstage is EP/device-independent, so all 8 combos produce an identical export. After each combo builds, its export is hash-compared against a per-model canonical: the first is moved to<model_dir>/_shared/, later identical ones are deleted — one export copy instead of 8 (export is the largest, full-precision artifact).4.
--upload: stream to Azure Artifacts feed, then delete locallyModelkitfeed as a Universal Package, then deletes the local copy — peak disk stays at ~one model's matrix.az login(Entra ID), no PAT. Theazure-devopsextension is ensured and login verified up front; if not ready, the run aborts (so disk isn't silently filled).winml-cli-models, one version per model:0.0.0-<run-stamp>-<model-slug>(valid SemVer 2.0; the0.0.0-core keeps it legal, the date stamp + slug are the pre-release segment). The shared run-stamp groups a batch.--continue+ the same--run-stampskips already-uploaded models without rebuilding them. Already-uploaded models are detected two ways: a localbuild_only_uploads.jsonmanifest, and a query against the feed itself at startup (versions matching0.0.0-<run-stamp>-*). Because the manifest is only written after a successful upload, a fresh--output-dirwould otherwise start empty and rebuild everything; seeding from the feed makes it authoritative for what's published, so resume works regardless of local state. A feed-query failure falls back to local-manifest-only behavior.--run-stamp,--keep-local,--upload-skip-existing,--feed/--feed-org/--feed-project/--package-name.Usage
Download a specific model's specific file later:
az artifacts universal download \ --organization https://dev.azure.com/microsoft --project windows.ai.toolkit \ --scope project --feed Modelkit --name winml-cli-models \ --version 0.0.0-20260609-microsoft-resnet-50-image-classification \ --path ./out --file-filter 'qnn_npu/model.onnx*'Notes / scope
src/winml/modelkit/commands/build.pychange is gated onconfig.compile is None, so it only affects no-compile builds; compiling builds are unchanged and still fail fast on a missing EP.--continueresumes from the feed (versions matching0.0.0-<run-stamp>-*) and the localbuild_only_uploads.jsonmanifest, so a resumed run only needs the same--run-stamp— it no longer has to reuse the same--output-dir. The feed query uses two&-free REST GETs (list packages → resolve the UPack package GUID → list versions) becauseazresolves toaz.cmdand cmd.exe splits query strings on&, dropping every parameter after the first.Modelkit, and--file-filterdownload of individualmodel.onnxfiles.